Matching
Data
We’ll load up the Lalonde data. Our goal here is to see if we can replicate or at least get close to the correct estimate in Lalonde’s 1986 paper using observational data. According to his model, the actual effect for the job training program was a $1,794 increase in the subject’s wages in 1978.
As usual a good starting point here is to get some basic descriptive stats on our variables.
lalonde|>
select(treat, educ, married, nodegree, race)|>
mutate(treat = factor(treat, labels=c("control", "treatment")),
married = factor(married, labels=c('single','married')),
nodegree = factor(nodegree, labels=c('degree','no degree'))
)|>
ggpairs(aes(fill=treat, color=treat), upper='blank') +
theme_bw() +
scale_fill_brewer(palette='Dark2') +
scale_color_brewer(palette='Dark2')lalonde|>
select(treat, age, re74, re75, re78)|>
mutate(treat = factor(treat, labels=c("control", "treatment")))|>
ggpairs(aes(fill=treat, color=treat), upper='blank') +
theme_bw() +
scale_fill_brewer(palette='Dark2') +
scale_color_brewer(palette='Dark2')Regression
We’ll start with regression based estimates. The first model includes no controls, and the second includes controls for age, education, marital status, past income, degree status, and race:
# the uncontrolled regression
mod0<-lm(re78 ~ treat, data=lalonde)
mod1<-lm(re78 ~ treat + age + educ + married + re74 + re75+ nodegree +race,data=lalonde)
modelsummary(list("treatment only" =mod0,
"treatment + controls" = mod1),
estimate = "{estimate}",
fmt = fmt_significant(),
statistic ='conf.int',
conf_level = .95,
gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.', # remove some model stats
# add a title
title = "DV: Wages"
)| treatment only | treatment + controls | |
|---|---|---|
| (Intercept) | 6984 | -1174 |
| [6276, 7693] | [-5998, 3650] | |
| treat | -635 | 1548.2 |
| [-1926, 655] | [13.9, 3082.6] | |
| age | 13.0 | |
| [-50.8, 76.8] | ||
| educ | 403.9 | |
| [91.9, 716.0] | ||
| married | 407 | |
| [-959, 1772] | ||
| re74 | 0.2964 | |
| [0.1819, 0.4108] | ||
| re75 | 0.2315 | |
| [0.0261, 0.4370] | ||
| nodegree | 260 | |
| [-1404, 1924] | ||
| racehispan | 1740 | |
| [-261, 3740] | ||
| racewhite | 1241 | |
| [-269, 2750] | ||
| Num.Obs. | 614 | 614 |
| R2 Adj. | 0.000 | 0.135 |
| BIC | 12712.0 | 12666.1 |
Matching methods
Propensity Scores
You can get propensity score matches by using method="nearest" and distance='glm'
psm_matched <- matchit(treat ~ age + educ + married + race +
nodegree + re74 + re75,
data = lalonde,
method = "nearest",
distance= 'glm'
)Once we have our matchit object, the Cobalt package gives us some options for balance checking. After examining the results, you might want to go back and re-specify the model.
Balance Measures
Type Diff.Un Diff.Adj M.Threshold
distance Distance 1.7941 0.9739
age Contin. -0.3094 0.0718 Balanced, <0.1
educ Contin. 0.0550 -0.1290 Not Balanced, >0.1
married Binary -0.3236 -0.0216 Balanced, <0.1
race_black Binary 0.6404 0.3730 Not Balanced, >0.1
race_hispan Binary -0.0827 -0.1568 Not Balanced, >0.1
race_white Binary -0.5577 -0.2162 Not Balanced, >0.1
nodegree Binary 0.1114 0.0703 Balanced, <0.1
re74 Contin. -0.7211 -0.0505 Balanced, <0.1
re75 Contin. -0.2903 -0.0257 Balanced, <0.1
Balance tally for mean differences
count
Balanced, <0.1 5
Not Balanced, >0.1 4
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
race_black 0.373 Not Balanced, >0.1
Sample sizes
Control Treated
All 429 185
Matched 185 185
Unmatched 244 0
bal.plot(psm_matched)love.plot(psm_matched)Access the matched data with the match.data function:
psm_data<-match.data(psm_matched)
head(psm_data)Then we can use the resulting balanced data in a regression model:
psm_fit<-lm(re78 ~ treat + age + educ + married + race +nodegree + re74 + re75,
data = psm_data,
weight = weights )Finally, we want to estimate our effects with cluster robust standard errors:
avg_comparisons(psm_fit,
variables = "treat",
vcov = ~subclass)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1345 719 1.87 0.0614 4.0 -64.2 2754
Term: treat
Type: response
Comparison: 1 - 0
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
modelsummary(list(
"psm" = psm_fit
),
estimate = "{estimate}",
fmt = fmt_significant(),
statistic ='conf.int',
conf_level = .95,
vcov =~subclass,
gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.', # remove some model stats
# add a title
title = "DV: Wages"
)| psm | |
|---|---|
| (Intercept) | -2582 |
| [-9199, 4036] | |
| treat | 1345 |
| [-69, 2759] | |
| age | 7.8 |
| [-79.0, 94.6] | |
| educ | 602 |
| [178, 1027] | |
| married | -158 |
| [-1996, 1680] | |
| racehispan | 1533 |
| [-491, 3558] | |
| racewhite | 469 |
| [-1328, 2267] | |
| nodegree | 923 |
| [-1392, 3239] | |
| re74 | 0.0264 |
| [-0.3012, 0.3539] | |
| re75 | 0.221 |
| [-0.109, 0.550] | |
| Num.Obs. | 370 |
| R2 Adj. | 0.025 |
| BIC | 7650.3 |
| Std.Errors | by: subclass |
Coarsened Exact matching
For CEM, we’ll often want to set manually set some of “cutpoints”. There’s no set rules here. Visual inspection of the predictors can help, and the MatchIt package has some defaults that can be sensible, but this is partly a judgement call: what values of the independent variables seem like they should be grouped together?
By default, observations are matched exactly on factor variables. However, you can override this by setting passing a list of levels to the grouping argument:
cem_match <- matchit(treat ~ age + educ + married + race +
nodegree + re74 + re75, data = lalonde,
method = "cem",
estimand ='ATE',
cutpoints =list(educ = c(0,9, 12, 14)),
grouping = list(race = list(c("white", "hispan"),
c("black")))
)
summary(cem_match)
Call:
matchit(formula = treat ~ age + educ + married + race + nodegree +
re74 + re75, data = lalonde, method = "cem", estimand = "ATE",
cutpoints = list(educ = c(0, 9, 12, 14)), grouping = list(race = list(c("white",
"hispan"), c("black"))))
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
age 25.8162 28.0303 -0.2419 0.4400 0.0813
educ 10.3459 10.2354 0.0448 0.4959 0.0347
married 0.1892 0.5128 -0.7208 . 0.3236
raceblack 0.8432 0.2028 1.6708 . 0.6404
racehispan 0.0595 0.1422 -0.2774 . 0.0827
racewhite 0.0973 0.6550 -1.4080 . 0.5577
nodegree 0.7081 0.5967 0.2355 . 0.1114
re74 2095.5737 5619.2365 -0.5958 0.5181 0.2248
re75 1532.0553 2466.4844 -0.2870 0.9563 0.1342
eCDF Max
age 0.1577
educ 0.1114
married 0.3236
raceblack 0.6404
racehispan 0.0827
racewhite 0.5577
nodegree 0.1114
re74 0.4470
re75 0.2876
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
age 21.1665 20.7539 0.0451 0.8970 0.0144
educ 10.1082 10.1923 -0.0340 0.7899 0.0153
married 0.0619 0.0619 -0.0000 . 0.0000
raceblack 0.5876 0.5876 0.0000 . 0.0000
racehispan 0.1314 0.0940 0.1256 . 0.0375
racewhite 0.2809 0.3184 -0.0946 . 0.0375
nodegree 0.7423 0.7423 -0.0000 . 0.0000
re74 472.0782 742.1934 -0.0457 1.2640 0.0478
re75 466.2790 677.2163 -0.0648 1.1178 0.0549
eCDF Max Std. Pair Dist.
age 0.1294 0.1133
educ 0.1186 0.3455
married 0.0000 0.0000
raceblack 0.0000 0.0000
racehispan 0.0375 0.2567
racewhite 0.0375 0.1933
nodegree 0.0000 0.0000
re74 0.2828 0.0847
re75 0.2699 0.1654
Sample Sizes:
Control Treated
All 429. 185.
Matched (ESS) 86.99 32.26
Matched 112. 82.
Unmatched 317. 103.
Discarded 0. 0.
Now we can plot our results to check out how well the matching performed:
plot(cem_match,
type = "density",
interactive = TRUE,
which.xs = ~age +
married + re75)Balance Measures
Type Diff.Un Diff.Adj M.Threshold
age Contin. -0.2419 0.0451 Balanced, <0.1
educ Contin. 0.0448 -0.0340 Balanced, <0.1
married Binary -0.3236 -0.0000 Balanced, <0.1
race_black Binary 0.6404 0.0000 Balanced, <0.1
race_hispan Binary -0.0827 0.0375 Balanced, <0.1
race_white Binary -0.5577 -0.0375 Balanced, <0.1
nodegree Binary 0.1114 -0.0000 Balanced, <0.1
re74 Contin. -0.5958 -0.0457 Balanced, <0.1
re75 Contin. -0.2870 -0.0648 Balanced, <0.1
Balance tally for mean differences
count
Balanced, <0.1 9
Not Balanced, >0.1 0
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
re75 -0.0648 Balanced, <0.1
Sample sizes
Control Treated
All 429. 185.
Matched (ESS) 86.99 32.26
Matched (Unweighted) 112. 82.
Unmatched 317. 103.
bal.plot(cem_match)love.plot(cem_match)Finally, we’ll want to use the match.data function to extract the matched data and then estimate our model with the weights included:
cem_data<-match.data(cem_match)
cem_fit <- lm(re78 ~
treat + age + educ + married + re74 +re75+
nodegree+ race , data = cem_data, weights = weights)avg_comparisons(cem_fit, variables = "treat", vcov = ~subclass)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1672 1079 1.55 0.121 3.0 -444 3787
Term: treat
Type: response
Comparison: 1 - 0
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
modelsummary(list(
"psm" = psm_fit,
"cem" = cem_fit
),
estimate = "{estimate}",
fmt = fmt_significant(),
statistic ='conf.int',
conf_level = .95,
vcov =~subclass,
gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.', # remove some model stats
# add a title
title = "DV: Wages"
)| psm | cem | |
|---|---|---|
| (Intercept) | -2582 | -4693 |
| [-9199, 4036] | [-13834, 4449] | |
| treat | 1345 | 1672 |
| [-69, 2759] | [-458, 3801] | |
| age | 7.8 | 106.3 |
| [-79.0, 94.6] | [-41.9, 254.5] | |
| educ | 602 | 608 |
| [178, 1027] | [14, 1201] | |
| married | -158 | -2533 |
| [-1996, 1680] | [-5348, 282] | |
| racehispan | 1533 | 957 |
| [-491, 3558] | [-422, 2335] | |
| racewhite | 469 | 2391 |
| [-1328, 2267] | [224, 4559] | |
| nodegree | 923 | 826 |
| [-1392, 3239] | [-1886, 3538] | |
| re74 | 0.0264 | -0.0964 |
| [-0.3012, 0.3539] | [-0.7940, 0.6012] | |
| re75 | 0.221 | 0.6872 |
| [-0.109, 0.550] | [-0.0556, 1.4300] | |
| Num.Obs. | 370 | 194 |
| R2 Adj. | 0.025 | 0.061 |
| BIC | 7650.3 | 3999.8 |
| Std.Errors | by: subclass | by: subclass |
Question
See if you can improve on the CEM matching used above.
# your code